AITopics | action head

Collaborating Authors

action head

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Neural Information Processing SystemsJun-22-2026, 17:43:01 GMT

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
(2 more...)

Add feedback

AGUVIS-7BUI-TARS-7BOS-Atlas-7BUGround-7BSeeClick+VerifierGUI-Actor-7BUI-TARS-2BGUI-Actor-2BShowUI-2BAriaUI-3.9BUGround-2B+Verifier

Neural Information Processing SystemsJun-15-2026, 02:34:04 GMT

One of the principal challenges in building VLM-powered GUI agents is visual grounding--localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as singlepoint predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actorintroduces an attention-based action head that learns to align a dedicated token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actoroutperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B achieves scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones, outperforming UI-TARS72B (38.1) on ScreenSpot-Pro, with significantly fewer parameters and training data. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head ( 100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.

arxiv preprint arxiv, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

BAKU: AnEfficientTransformerfor Multi-TaskPolicyLearning

Neural Information Processing SystemsFeb-18-2026, 20:35:36 GMT

Inthiswork,wepresentBAKU,asimple transformer architecture that enables efficient learning of multi-task robot policies.BAKU builds upon recent advancements in offline imitation learning and meticulously combines observation trunks, action chunking, multi-sensory observations, and action heads tosubstantially improveupon prior work.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)

Add feedback

67b0e7c7c2a5780aeefe3b79caac106e-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 13:03:18 GMT

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Shikoku > Kagawa Prefecture > Takamatsu (0.04)
Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

SAM2Grasp: Resolve Multi-modal Grasping via Prompt-conditioned Temporal Action Prediction

Wu, Shengkai, Yang, Jinrong, Luo, Wenqiu, Gao, Linfeng, Shang, Chaohui, Zhi, Meiyu, Sun, Mingshan, Yang, Fangping, Ren, Liangliang, Zhao, Yong

arXiv.org Artificial IntelligenceDec-3-2025

Imitation learning for robotic grasping is often plagued by the multimodal problem: when a scene contains multiple valid targets, demonstrations of grasping different objects create conflicting training signals. Standard imitation learning policies fail by averaging these distinct actions into a single, invalid action. In this paper, we introduce SAM2Grasp, a novel framework that resolves this issue by reformulating the task as a uni-modal, prompt-conditioned prediction problem. Our method leverages the frozen SAM2 model to use its powerful visual temporal tracking capability and introduces a lightweight, trainable action head that operates in parallel with its native segmentation head. This design allows for training only the small action head on pre-computed temporal-visual features from SAM2. During inference, an initial prompt, such as a bounding box provided by an upstream object detection model, designates the specific object to be grasped. This prompt conditions the action head to predict a unique, unambiguous grasp trajectory for that object alone. In all subsequent video frames, SAM2's built-in temporal tracking capability automatically maintains stable tracking of the selected object, enabling our model to continuously predict the grasp trajectory from the video stream without further external guidance. This temporal-prompted approach effectively eliminates ambiguity from the visuomotor policy. We demonstrate through extensive experiments that SAM2Grasp achieves state-of-the-art performance in cluttered, multi-object grasping tasks.

artificial intelligence, arxiv preprint arxiv, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2512.02609

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Decoupled Action Head: Confining Task Knowledge to Conditioning Layers

Zhou, Jian, Lin, Sihao, Fu, Shuai, WU, Qi

arXiv.org Artificial IntelligenceNov-18-2025

Behavior Cloning (BC) is a data-driven supervised learning approach that has gained increasing attention with the success of scaling laws in language and vision domains. Among its implementations in robotic manipulation, Diffusion Policy (DP), with its two variants DP-CNN (DP-C) and DP-Transformer (DP-T), is one of the most effective and widely adopted models, demonstrating the advantages of predicting continuous action sequences. However, both DP and other BC methods remain constrained by the scarcity of paired training data, and the internal mechanisms underlying DP's effectiveness remain insufficiently understood, leading to limited generalization and a lack of principled design in model development. In this work, we propose a decoupled training recipe that leverages nearly cost-free kinematics-generated trajectories as observation-free data to pretrain a general action head (action generator). The pretrained action head is then frozen and adapted to novel tasks through feature modulation. Our experiments demonstrate the feasibility of this approach in both in-distribution and out-of-distribution scenarios. As an additional benefit, decoupling improves training efficiency; for instance, DP-C achieves up to a 41% speedup. Furthermore, the confinement of task-specific knowledge to the conditioning components under decoupling, combined with the near-identical performance of DP-C in both normal and decoupled training, indicates that the action generation backbone plays a limited role in robotic manipulation. Motivated by this observation, we introduce DP-MLP, which replaces the 244M-parameter U-Net backbone of DP-C with only 4M parameters of simple MLP blocks, achieving a 83.9% faster training speed under normal training and 89.1% under decoupling.

action head, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.12101

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

An Efficient Transformer for Multi-Task Policy Learning

Neural Information Processing SystemsOct-10-2025, 22:39:50 GMT

Videos of the robot are best viewed at baku-robot.github.io.

arxiv preprint arxiv, experiment, learning, (14 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Y ang Yue

Neural Information Processing SystemsOct-10-2025, 04:49:39 GMT

MLLM based on each situation at hand.

arxiv preprint arxiv, language model, mllm, (15 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Shikoku > Kagawa Prefecture > Takamatsu (0.04)
Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Add feedback

VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Lin, Juyi, Taherin, Amir, Akbari, Arash, Akbari, Arman, Lu, Lei, Chen, Guangyu, Padir, Taskin, Yang, Xiaomeng, Chen, Weiwei, Li, Yiqian, Lin, Xue, Kaeli, David, Zhao, Pu, Wang, Yanzhi

arXiv.org Artificial IntelligenceOct-6-2025

Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, current VLA models suffer from two drawbacks: (i) generation of massive tokens leading to high inference latency and increased training cost, and (ii) insufficient utilization of generated actions resulting in potential performance loss. To address these issues, we develop a training framework to finetune VLA models for generating significantly fewer action tokens with high parallelism, effectively reducing inference latency and training cost. Furthermore, we introduce an inference optimization technique with a novel voting-based ensemble strategy to combine current and previous action predictions, improving the utilization of generated actions and overall performance. Our results demonstrate that we achieve superior performance compared with state-of-the-art VLA models, achieving significantly higher success rates and 39$\times$ faster inference than OpenVLA with 46 Hz throughput on edge platforms, demonstrating practical deployability. The code is available at https://github.com/LukeLIN-web/VOTE.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.05116

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

A Ablations

Neural Information Processing SystemsAug-15-2025, 23:29:18 GMT

We find that past play greatly stabilizes the emergence of reciprocity in IPD. In cells containing another agent, we include the RUSP observations in these channels. In Figure 11 we show results when training with RUSP in these environments. Consistent with past work, the greedy baseline fails to reach a solution with high collective return. We use a distributed computing infrastructure used in Berner et al.

action head, agent, prisoner, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.31)

Add feedback